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Abstract 

We develop a weighted local likelihood estimate for the parameters that govern the local spatial 
dependency of a locally stationary random field. The advantage of this local likelihood estimate is that it 
smoothly downweights the influence of far away observations, works for irregular sampling locations, and 
when designed appropriately, can trade bias and variance for reducing estimation error. This paper starts 
with an exposition of our technique on the problem of estimating an unknown positive function when 
multiplied by a stationary random field. This example gives concrete evidence of the benefits of our local 
likelihood as compared to naive local likelihoods where the stationary model is assumed throughout a 
neighborhood. We then discuss the difficult problem of estimating a bandwidth parameter that controls 
the amount of influence from distant observations. Finally we present a simulation experiment for 
estimating the local smoothness of a local Matern random field when observing the field at random 
sampling locations in [0, l] 2 . 

1 Introduction 

Stationary random fields play a fundamental role in both theoretical and applied spatial statistics. Unfor- 
tunately, stationarity is often violated when working with real data. The issue is more pressing with the 
recent data deluge and high resolution sensing where non-stationarity can be clearly visible. This presents 
a challenge for the spatial statistician who is interested in estimating and modeling dependency structure 
in random fields. 

Even though stationarity is often criticized as being too simplistic, we believe it is still important for 
the understanding and development of nonstationary models. The reason is that any type of statistical 
estimation requires some sort of replication to "average over" . In spatial statistics the data often comprise of 
one realization of a random field. In this case, the assumption of stationarity provides a type of replication 
that makes statistical estimation possible. For nonstationary random fields, however, the lack of any 
assumption leads to a breakdown in statistical estimation due to the absence of replication. This problem 
can be mitigated by adding assumptions on the nonstationary random field like local stationarity, for 
example. The idea is that on small enough spatial scales, one hopes that the local dependency of the 
random field near a point is well approximated by some stationary random field. In this paper, we do 



not attempt a precise definition of local stationarity (see [3] for a definition in the time series literature). 
Instead, we take it for granted that such random fields exist (however one defines it) and enter into a 
discussion of how one might estimate the parameters of a local stationary approximation when observing 
a single realization of the nonstationary random field at dense (possibly uneven) observation locations. To 
accomplish these goals we develop a local likelihood approach for estimating local parameters. 

Before we continue we mention why the standard local likelihood techniques fail for estimating the local 
dependency of a random field. In current techniques (see J5], for example), the independence structure of 
the data typically allow one to decompose the log-likelihood as a sum. Each summand depends on one 
data point, which can be down weighted as a function of some spatial covariate. In the random field 
case, however, there is no independence and therefore no such decomposition of the log-likelihood. A 
different technique that does work for random fields is to simply divide the observation locations into 
neighborhoods and fit a stationary random field model on each neighborhood. Typically two problems 
arise with this approach. First, the range of validity of a stationary approximation can be too small to 
contain enough data to estimate it. Second, it can produce non-smooth local parameter estimates, which 
can be undesirable in many cases. 

In this paper we present an exposition through computation, simulation and some theory of our version 
of local likelihood estimation. A large portion of the paper is devoted to the discussion of different ways 
of constructing and estimating the weights used in our local likelihood that downweights the influence 
of distant observations of the random field. We start in Section [2] with our definition of a weighted local 
likelihood and then immediately apply it to the problem of variance modulation in Section [3] This example 
is convenient since the local likelihood estimate has a closed form and one can derive the Bayes risk under 
a polynomial prior for o~(x). In Section [4] we study the problem of estimating the bandwidth parameter A 
from data. Finally, in Section [5] we apply these techniques to the estimation of the local fractional index 
of a local Matern random field when observing one realization of the field at uneven observation locations 
in R 2 . 

2 Weighted Local Likelihood 

Before we present our notion of local likelihood it will be advantageous to set some notation. We write 
a random field as {Z(t): t € M d } or just Z when the domain of definition is clear from context. We 
distinguish two types of random field models: global and local. The only real distinction is nonstationary 
versus stationary but the nomenclature is useful since we regard global models as the true nonstationary 
sampling distribution and local models as the local stationary approximations. We consider global models 
that are nonstationary random fields indexed by some nonparametric function 6{t) that takes spatial 
arguments t £ M rf and returns values in some m-dimensional parameter space C M m . We call the 
function #(•) the local parameter function and denote resulting global model Ggr.y For each fixed to £ 
M. d , the parameter vector 6q = 6{to) determines a local random field model, denoted by Lg , which is 
generally stationary and models the stochastic behavior of Z near some point to. Informally this means 
that CiL 9o {Z(to + h): \h\ < e} « CjG g ^{Z(to + h): \h\ < e} when e is small (&L 8o denotes the law of 
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the finite dimensional distributions under model Lg Q ). For the remainder of the paper a local parameter 
function will be denoted by #(•), 9(t) or just 9 and a particular value in the parameter space O C M. m will 
be written 9q. 

For a concrete example, let Z be a random field on M d such that there exists a fixed but unknown 
function <j(t) : M d — > M + and a known stationary random field W such that Z(t) = a(t)W(t). In this case, 
a(t) is the local parameter function and G CT (.) denotes the true distribution of Z. If c(to + h) ~ o"o when 

is small then the local model for Z at to, denoted by L UQ , is just the law of o~oW(t). Presumably if one 
has enough data in a neighborhood of to one could successfully estimate a (to) by fitting the model L aQ 
to the data. Notice there is an inherent bias-variance tradeoff when fitting L ao locally to data: increasing 
the size of the local neighborhood reduces the variability of the estimate but increases the bias due to the 
inaccuracy of the stationary approximation. It is with local likelihoods that we seek to balance these two 
competing terms by smoothly down weighting the dependence of far away observations. 

To define the local likelihood estimation of #(•) suppose we have n observations (ti, z\), . . . , (t n , z n ) of a 
single realization of a random field Z (the spatial locations are tj £ M. d and the responses are Zj = Z(tj)). 
For any given location t € M. d , let "Nth denote the set of k observations nearest to t. Given weights 
w±, . . . , w n (possibly depending on t, tk and a bandwidth parameter A) we define the following weighted 
local likelihood: 

n 

W A (0 o ,t|data) 4 ^ Wk [£(Ji tjk \L 6o ) - l^^L^)) (1) 
fe=l 

where l(Nt fc 1-^00 ) is the log likelihood of the data in 3sf t fc under the model Lq (note: we define £(Nt,o\Lo ) 
to be 0). Now our local likelihood estimate of 8(t) is defined as 

0\(t) = are max W\(#o> tldata). 

o eR m 

We first remark that if the weights Wk = 1 for all k, then the telescoping sum in ([TJ collapses and one 
recovers the full likelihood for the local stationary model. Indeed W\ orders and subsequently downweights 
the incremental changes in the stationary likelihood when adding the observations one by one in order of 
their distance to t. An important feature of the estimate 9\ is that, at least for Gaussian random fields, 
the computational cost of W\(9q, i|data) is comparable to that of ^(#|data) by either up dating a Cholesky 
decomposition or down dating an inverse covariance matrix. Moreover, the estimate 9\(t) will typically 
be a "smooth" function of t (when the weights are smooth) even when there is no natural additive or 
multiplicative structure on the parameter space C M m . Finally we remark that 9\ depends on the 
bandwidth parameter A through the weights which typically have the form w k = K((t — t k )/X) for 
some smoothing kernel K. 

3 Variance modulation 

We start with a particularly simple example, already mentioned in the previous section: estimating the local 
variance of a Gaussian random field. Consider the estimation of the function o~(t) : M. — > M + when observing 
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Figure 1: Left: The simulated a{t)W{t) were W is a mean zero Gaussian random field with Matern 
autocovariance function with parameters (a 2 ,v,p) = (1,0.8,0.2) and a(t) = 2 sin(t/0.015) + 2.8. Middle: 
The increments of a(t)W(t) on the 200 evenly spaced observation locations in the interval [0,0.1]. Right: 
Plot of the true local parameter function <r 2 (t) (dashed) along with two estimates of <r 2 (i), one using smooth 
weights (blue) and the other using hard thresholding weights (i.e. naive local likelihood estimation). 



a{t)W{t) where W a mean zero Gaussian process on K with known Matern parameters. This example is 
useful since the local maximum likelihood estimate has a closed form solution. We take advantage of this 
solution by deriving the Bayes risk under polynomial priors. The Bayes risk is then used to give concrete 
evidence of a bias and variance trade for reducing estimation error. We also use the Bayes risk to explore 
the relationship between higher order kernels, bias of the resulting estimates and optimal bandwidth. 

To derive the closed form solution of the local likelihood estimate we first need some notation. Let 
Sfc denote the covariance matrix of the observations in the k neighborhood of t, i.e. 3Mt and let Zk — 
(zi, . . . , Zk) T denote the responses of the data in "Nt,k- Notice that 



-1_ 
k Zk 



log |crgS fc | 



k 



and therefore 



2a 2 



log(27T) 



n j n 

W A (a ,t|data) = - log a ^ w k + ^ w k \ 

k=l k=l 



z k-l^k-l Z k-l 



z T y- 1 

Z k ^k 



+ C. 



The maximum occurs at 



~2 A 
A = 



z I^k z k 



Z k -i^ k -l Z k- 



(2) 



En 
k=l w k 

which is the local maximum weighted likelihood estimate of a 2 

Figure [T] shows an example of this local likelihood estimate (blue line in the right-hand diagram) as 
compared with naive local likelihood estimation (red line in the right-hand diagram) where the stationary 



at t. 
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model is presumed to hold through the local neighborhood. The left most diagram in Figure [T] shows 
the simulated field a(t)W(t) at 200 evenly spaced observation locations in [0, 1], where If is a mean zero 
Gaussian random field with known Matern autocovariance function with parameters (a 2 , v, p) = (1, 0.8, 0.2) 
(using the parameterization given on page 50 of [T3|) and unknown a(t) = 2sin(i/0.015) + 2.8. Notice that 
even though the local variance changes significantly throughout the observation region, it is difficult to see 
since the range parameter p = 0.2 is relatively large compared with the observation region. However, if one 
looks at increments of a(t)W(t), shown in the middle diagram, the change in local variance becomes clearly 
visible. The true local parameter function <J 2 (t) is shown in the right hand plot of Figure [l] (dashed). The 
smooth weights take the form Wk = K((t — tk)/X) where K is defined as K(t) = e _t / 2 (15 — 10i 2 + 1 4 )/8 



which is kernel Kq presented in [T7] (and also used in Section 3.1 below). The bandwidth used for both 
estimators is obtained by minimizing the Kullback-Leibler divergence of the estimated global model G&*r.) 
to the truth at the observation locations. Notice that using smooth weights, the local likelihood estimate 
can not only reduce estimation error but can also yield smooth estimates of the local parameter function 
a 2 . 

Remark: Although the computation of each S^ 1 in Q can be very time consuming, there is a down- 
dating algorithm that makes all the inverses obtainable in nearly the same computational time as E" 1 
(where n is the number of observations). Once the inverse S^ 1 is computed, the downdate ^~u-\ is easily 
computable by the following formula: 

b\ + k 




where k = c — b T T,7_J). Therefore 

ii, , Sr x (l : fc-l.AOErV&.l : k-1) 

S fc (k, k) 

where S^" 1 (a: b,c:d) denotes the sub-matrix of S^ 1 of rows a through b and columns c through d. Note 
that the estimates cr 2 {t) and <r 2 (s) at two different locations i / s are very similar: the only difference is 
the order one downdates the row and column of S^ 1 . This means one can construct a 2 at may different 
spatial locations by computing S" 1 once and for all, then downdate the inverse in different row and column 
orders. 



3.1 Bayes risk under polynomial priors for er(-) 

Now we study the behavior of <j\{to) at some fixed spatial location to- We think of the parameters that 
govern the deviation of cr(t) from a(to) as nuisance parameters and model this situation by supposing a 
has the following finite taylor expansion 

N 

cr{t) = co + Y,c P {t-to) k . (3) 
p=i 
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The zero order coefficient, cq, is the parameter of interest and the higher order coefficients, c p for p > 0, 
are nuisance parameters. Now using a prior 7r(c) for the nuisance coefficients c = (c±,. .. , c/v) we derive 
the Bayes risk of <7\(io) under L2 loss. For simplicity we suppose vr(c) = 7Ti(ci) • • • ttn{cn)i E n .Cj = and 
E nj c^ = for all j > 1. One of the goals of this section to attempt to understand the relationship between 
these nuisance parameters and using higher order kernels that are orthogonal to polynomials of order at 
most N. The other goal is to study the relationship between bias, risk, bandwidth and parameters of the 
stationary random field. 

Let (ii, . . . , tk) be the observation locations for z k = (z±, . . . , z n ) T so that Zj = a(tj)W(tj). To make the 
notation clearer we suppose the observations are ordered by their distance to to, so that Nt ,3 = {zi, Z2, 23} 
for example. Now zj~ has covariance function A a T,kA a where 

N 

A a 4 diag( ( 7(t 1 ), . . . , a(t n )) = £ c k A? (4) 

p=0 

and A£ = diag[(ti — to) p , ... ,(t n — to) p ]. To derive expressions for the bias and mean square error for 
<7^(£o)j notice that 

N 

E[zl^ k l z k \c] = Etr^E-'lc] = tr[A CT S fc A CT S fc X ] = £ c Pl c P2 tr [S^A^Af ] (5) 

Pl,P2=0 

and 

cov^S-^fe, zjVfzjlc] = covfzJS-^fc, z^ET^ fc z fc | c ] = trp- 1 A (T S fe A CT S-^ fe A^A,] (6) 

= 2 £ ^•••^[S^AfS fc AfST^AfS fe Af] (7) 

pi,...,p 4 =0 

where k > j and S~^ fe — ( E Q is the matrix S" 1 padded with zeros so that it has the same size as 
E^" 1 . At this point it becomes convenient to re-write <5^(to) as Z)fc=i z k^k lz k where 

~ a J(wjfc - Wfc+i)/D"=iWj> if^<^; 
\^n/I]™ = i^j, iffc = n. 

Now the expected value of a^(to), conditional on c = (c±, . . . , cjy), is 

n n N N 

E[al{t G )\c]=Y J ^kE[z T k ^ k 1 z k \c]=Y J E tB^tfetr^AfEfcA^ ] = £ ^%B^> (8) 

fc=l fe=lpi,P2=0 Pl,P2=0 
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where 4 £jj =1 ^ 

■[al(t )\c] = var 



. The variance is 



var 



= ^ w k WjZov[yl^ k x z k ,y]^. l Zj ] (9) 
fc=i J j,fc=i 

n Af 

2 E E ^% c Pl --- c P4tr[S fc - 1 A^S fc AfST A ^. vfc AfS fc Af] (10) 
j,fc=i pi,...,p4=o 

AT 

£ Cp, ■ • • CaB**-** (11) 

pi,...,p 4 =0 



where 4 2 ££ fc=1 t^tr [s^A^Af S7^. vfe Af E fc Af ^ 

under the prior distribution for the nuisance parameters can be decomposed into two quadratic terms 



. Now the expected squared bias 



^[bias 2 ] = E T [E[a 2 (t )\c] - a 2 (t )} 2 = E T 



A' 



Pl,P2=0 
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Pl.P2 = l 


P =l 
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+ 4c 2 ) £ 7r 






Pl,P2 = l 




p=i 



(12) 
(13) 
(14) 



since B°'° = Y2k=l = ^ an< ^ ^°' P = Note that the cross term in (14) is zero by the assumption 



E nj Cj = for all j > 1. Therefore the Bayes risk is 



EtvE 



a 2 (t )-a 2 (t )] 2 



N 

EE \C C C C 1 RPl'P2iP3,P4 
-^TT L^Pl C P2 °P4 J 

Pl,P2 ,P3 ,P4=0 
AT 

+ ^ E n [cp x Cp 2 Cp 3 Cp 4 j _B Pl B^ 3 

P1,P2 ,P3,P4 = 1 

\N 

+ Ac 2 E^ Y, c p B ° ,P 
p=i 



(15) 
(16) 

(17) 



Notice that the above expression decomposes the Bayes risk into three quadratic terms. The last term 
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B 



0,p 



(17), which comes from the bias, can be rewritten as 

n n 

w k tr [A p k ] = Wktr [diag((ii - t ) p , ...,(t k - t ) p )] 

k=l 

n 



k=l 



k=l 



w 



- to y + ... + (t k -t )p] 

w k (t k - t ) p . 



n 

3 k=l 



(18) 
(19) 
(20) 



Therefore, if w k is constructed by w k = if ((to — £fc)M); where L K(t)dt = 1 and J K K(t)t p dt = for some 
p > 1, then under mild conditions on the sampling locations t k and K 



b o,p 



E 



J fa 

A 



to) 







as n — > oo, A — ► and the t k s get more dense in a bounded region near to- In particular, if one uses higher 
order kernels, the last term (17) can be made arbitrarily small under infill asymptotics. Note: it is only 
necessary to consider A — > when our observation locations t k stay bounded in a compact domain and K 
has infinite support (which is the case we will consider in this section). 



% improvement in Bayes risk 



% improvement in Pias 





Figure 2: The percent improvement in Bayes risk (left) and squared bias (right) over hard threshold when 
using an oracle bandwidth selector and weights generated from kernel Kq. The rows of the heat maps 
correspond to different values of v and the columns to p for the Matern autocovariance function. 
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Figure 3: Bayes risk (right) and expected bias squared (left) of a?(l/2) plotted against bandwidth for 5 
different kernels: K 2 , K4, Kq, Kg and hard thresholding. The estimate <t|(1/2) is based on 150 observation 
locations in [0, 1] and the Matern autocovariance function with v = p = 0.8. 



Numerical Results For exposition we consider a class of Gaussian-based higher order kernels K 2r 
defined in [T7] as 

K 2r (t) 4 Q 2r _ 2 (t)ct>(t) (21) 

where 4>(t) = 'I 2 / '\^2tt is the Gaussian kernel, Q 2r -2 = {2 r_1 (r — l)l}~ 1 H 2r -i(t)/t, and Hj denotes the 
jth n0 rmalized Hermite polynomial defined by HAt) = (— iy (p^\t) / 4>(t) . These kernels have the required 
property that J K K{t)t p dt = for all < p < 2r. Of course these are not the only such kernels but we use 
them since the resulting estimates cr\{t) will be very smooth in t. Finally we compare all our results to the 
the hard thresholding weights defined as Wk = l_B A (t)(ifc) 5 where 1a is the indicator of the set A £ R and 
B\(t) is the ball of radius A centered at t. The hard thresholding weights yield the naive local likelihood 
estimate where the local stationary model is presumed to hold everywhere in the local neighborhood. 

Now we use the above derivation of Bayes risk to numerically investigate the improvement of local 
likelihood estimation over hard thresholding. We computed the Bayes risk and bias terms in equation (15) 

under the assumption that t$ = 1/2, er(to) = 2, and Cj *~ N(0,4) for j = 1,...,4 (so that N = 4 in 
^ ) . The heat maps in Figure [2] show the percent improvement over hard threshold when using kernel Kq 
defined in (21 ) to generate the weights Wk- The bandwidth for both Kq and hard thresholding were chosen 
using the Kullback-Leibler oracle criterion. The rows of the heat maps correspond to different values of v 
and the columns to p for the Matern autocovariance function. The observation locations for which the local 
likelihood estimate <j\ is based on are 100 evenly spaced points in [0, 1). First notice that there is a 17% 
improvement over hard thresholding uniformly over the possible values of v and p. An interesting feature 
of the leftmost diagram is that local likelihood estimates do better as the random field gets smoother. A 
possible explanation is that the smoother the random field the larger the neighborhood required to attain 
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a sufficient variance reduction. For these large neighborhoods the higher order kernels have significantly 
less bias. Finally, we mention that the right-hand diagram shows an improvement in expected bias squared 
that can reach almost 68%. 

In Figure [3] we plot the Bayes risk and expected bias squared as a function of bandwidth for the 
estimate ct?(1/2) using the kernels K2, K4, Kq, Kg and hard thresholding. The estimate is based on the 
Matern random field with parameters v = p = 0.8 with 150 even observation locations in [0,1]. We use 
the same polynomial prior as in the last paragraph. Notice the significant improvement in risk and bias 
squared using kernels K4, Kq and K%. Also notice that the improvement in risk and bias occurs at larger 
bandwidths for higher order kernels. 



4 Bandwidth selection 

The local likelihood estimates 9\ can vary dramatically depending on the choice of smoothing parameter 
A. Since the theoretical derivation of Bayes risk in anything but the simplest setting is extremely difficult 
it is necessary to develop estimates of A from data. Classical methods such as cross-validation may fail 
dramatically when the data is comprised of a single realization of a nonstationary random field. The data 
is very highly correlated so that a "leave out" prediction is problematic since the data "left out" is highly 
correlated with the data "left in" . In this section we give a heuristic for constructing a reasonable estimate 
A. We then present our interpretation of this heuristic in terms of two different estimates of A. At the end 
of this section we present some numerical simulations to illustrate their behavior. We make no claim of 
optimality or any theoretical justification other than heuristics. 

Our heuristic for the bandwidth estimator says that one should choose the bandwidth to maximize the 
spatial variability in 9\ beyond what is expected from the realization of the random field itself. The idea 
is that there are two sources of spatial variability in 9\. The first is the spatial variability from the true 
local parameter function 9 which, of course, 9\ is trying to estimate. The second is the spatial variation 
coming from the particular realization of the random field itself. Notice that the smaller the bandwidth 
A the more the spatial variation in 9\ is due to the random field realization, and less to the variability of 
9. To describe our mathematical interpretation of this heuristic we need some notation. Let l ?(9\) be a 
measure of spatial variation in 9\. For example, when the local parameter 9 is univariate so the 9{t) maps 
ff 1 to 1 a natural choice might be y{9\) — Jjgd |V#a| 2 - Let 9 be the MLE of the local parameter function 
obtained by assuming the global model is stationary (i.e. by assuming 8(t) is a constant function of t). Now 
consider simulating an independent realization of the data under the stationary random field model using 
the estimated parameter 9. Let EgT^x) denote the expected value of 7(9\) where 9\ is applied to the new 
realization of the stationary random field. The idea is that Eq7 , (9x) quantifies the spatial variation in 9\ 
that is exclusively due to the random field itself (since the true local parameter function is constant). Now 
we estimate A as follows: 

Ai=argmax ? . 22) 



In equation ( 22 ) we divided by sdg y{9\) , which denotes the standard deviation of l P(9\) under the stationary 
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model given by 9, to improve comparison across A. 

Our second bandwidth selector is similar to the first, the main difference being that the measure of 
spatial variation T(9 X ) is replaced with ^2 t (VJ x (9 x (t),t) — W x (9,t)). This is recognized as a type of local 
likelihood ratio test statistic (summed over the spatial variable t) that compares the estimate 9 X to the 
stationary fit 6. Then after adjusting by the expected behavior of this quantity under the stationary fit 9 
we get the following estimate of bandwidth: 



A2 = arg max ■ 
A 



Et(Wx(9x(t),t)-W x (9,t))-E d 


Et(Wx(9 x (t),t)-W x (9,t))~ 


sd 9 


Y, t (W\0x(t),t)-W x (9,t)) 





(23) 



In our implementation of both ( 22 ) and ( 23 ) we use simulations to estimate the expected value and standard 



deviation under the stationary fit 



estimates of Che local variance 



criterion for selecting bandwidth 



Using bandwidth estimate 1 
Using bandwidth estimate 2 
oracle 
truth 





0.02 0.03 
bandwidth 



Figure 4: Left: a 2 (dashed), cr| (blue), <r? (green) and & Xotc (dotted) when observing a single realization 
of a(t)W(t), where a(t) = 2 sin(i/0.015) + 2.8 is unknown and W is a stationary Gaussian random field 
with known Matern parameters (a,iy,p) = (1,0.5,0.5), at 1000 even sampling locations in [0,0.1]. Right: 
Plots of the standardized criterion profiles which, when maximized, give A orc (dotted), Ai (blue) and A2 
(green). 



To illustrate Ai and A2 we provide some simulations in the case of variance modulation (see Section 
[3]). Our first simulation is a single realization of a(t)W(t) at 1000 evenly spaced sampling locations in the 
interval [0,0.1] where a{t) = 2 sin(i/0.015) + 2.8 is unknown and W is a mean zero stationary Gaussian 
random field with known Matern autocovariance parameters (a, v, p) = (1, 0.5, 0.5). The left plot of Figure 

S shows the true local parameter function cr 2 (t) (dashed line) along with three different estimates a\ 
Xi 

(blue), a? (green) and a\ (dotted line). The estimate a\ is the oracle estimate, which estimates the 
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smoothness parameter by 

Ao^argmax^G-HG^)- 1 }, (24) 

where G a 2 denotes the true global model for a(t)W(t) at the observation locations and is Kullback- 

Leibler divergence. The right plot of Figure [4] shows the criterion profiles that are maximized for estimating 
the bandwidths Ai (blue), A2 (green) and A orc (dotted). The profiles in this diagram are standardized by 
their average and standard deviation so they can be compared on the same scale. The parameters of our 
second simulation are exactly the same with the exception that now v = 1 instead of v = 0.5. The results 
of this simulation are shown in Figure 5 where again, the left plot shows a 2 (dashed), <r 2 (blue), <r? 

(green) and ^\ orc (dotted). The right plot, again shows the standardized profiles for estimating A. 

For both simulations, Ai and A2 do a good job of estimating an appropriate bandwidth. The modes of 

the data-driven criterion profiles seem to come reasonably close to at least one of the modes in the oracle 

profiles. It is unclear to us, at this moment, why the oracle profile in both simulations have two modes. 

Regardless, we think it is a good sign that the data-driven profile modes are close to at least one oracle 

mode. Moreover the resulting estimates o\ and o\ give good agreement with the truth a 2 , although some 

1 2 

under-fitting can be seen in Figure |4j Notice that both of these simulations use a large range parameter 
p = 0.5 as compared to the size of the observation region [0,0.1]. This was intentional and designed to 
mimic the infill asymptotic regime of letting A — > as one gets denser observations. 
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5 Estimating a local fractional index in 2 dimensions. 



In this section we take full advantage of our local likelihood technique and estimate a spatially varying 
smoothness parameter when observing one realization of the field at random observation locations in [0, l] 2 . 
In our simulation we suppose that there is a priori information that the observed random field is locally 
isotropic with known local variance and range that do not vary spatially but with an unknown spatially 
varying smoothness parameter, denoted v% [t E M 2 ). There are many nonstationary random fields that 
have spatially varying smoothness parameter (see [TT], [16], [3] and [13] for example). For our simulation 
we use a nonstationary local Matern, presented below that has a closed form covariance, can attain any 
degree of differentiability or Holder smoothness and behaves locally like a stationary Matern. We include 
an appendix that proves the positive definiteness of our covariance function (which originally appeared in 
the technical report |15j). 

We start by deriving our nonstationary covariance structure with spatially varying local parameters. 
Let M l/ (-) = | • \ u % u {-) where % v is the modified Bessel function of the second kind and let u^, at and 
at denote spatially varying Matern parameters. The local parameter function vt : M ' — ► M + determines 
the local smoothness, a 2 : M. d — > M + determines local variance and a t : R d -> PD d (R) deter mines a local 



geometric anisotropy which maps M. d into the set of d x d positive definite matrices with real entries. Now 
define the following covariance function 



R(s,t) = a t a s det(a st 1/2 )M Ust (\aj /2 (t - s)\j 



(25) 



where a s t — (at + a s )/2, v s t = (vt + v s )/2. In the appendix we show that the nonstationary covariance 
function R is positive definite. 

One problematic feature about R is that it is difficult to separate the interpretation of the local scale 

^ /2 — 1/2 

det(a t ) and the smoothness parameter ut. This is because both det(a t ) and vt effect the local lag 



for which the correlation becomes close to zero. Therefore it is desirable to re-parameterize (25) to give 



distinct interpretations of the three local parameter functions: variance, range and smoothness. To simplify 
the exposition we suppose there is no geometric anisotropy so that at maps t G M. d into the set of positively 
scaled d x d identity matrices. In particular let pt : M. d — » M + and define 



K(t,s) 



r(i/ a )2"--i 



1/2 



r> t )2 



v t -\ 



1/2 



PI 



+ 



8vt 



-d/2 



+ 



8u t 



-1/2 



The advantage of this covariance function (which is a re-parameterization and simplification of (25)) is 
that when both s,t are near some fixed to, and the local parameter functions vt, pt and of are sufficiently 
smooth, we have that 



K(s,t) 



at 



Hi1s-*IMo)- 



This is recognized as an isotropic Matern autocovariance with parameterization found on page 50 of |14j . 
Therefore cr 2 is the local variance, vt is the local smoothness, and pt has the interpretation of the local 
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range. Now our a priori information on the observed random field amounts to the supposition that pt = P 
and Gt = o are constant functions of t 6 R d so that K simplifies to 



K(s,t) 



d/A d/A 



a 



d/2 



(r(^ s )2^- 1 ) 1 /2(r(^)2^- 1 ) 1 /2 



2y/V s V t 



PW^st 



(26) 



When estimating a local parameter function in two dimensions, bias becomes prominent near the 
boundary of the observation region. In our simulation we attempted to design the local likelihood weights 
to automatically mitigate this bias near the boundary. We can loosely motivate our weights by the theory 
of estimating equations. Notice that if the local parameter function 9 is univariate (so that it maps M rf 
into R) then J^W A (0 o , i|data) = Y%=i w kS' k (0o) where S k (9 ) = log fe {z k \H k -i,t) is the conditional log 
likelihood of the /c th nearest observation to t conditional on k — 1 nearer observations. Therefore the local 
likelihood estimate 9(t) is defined as solving 



5> fc S£(0(t)) = O 



k=i 

where both S k and w k depend on t. This suggests to design the weights w k to minimize varg [Ek=i w kS' k (9o)] 
subject to the unbiasedness constraint Y2 k =i w kEg{S k {9o)) = where 9{t) = 9q. Notice that Eg and var# 
denote expected value and variance with respect to the true nonstationary model Ggt.y To exclude the 
solution w\ = • • • = w n = we fix the scale J2k=i w k = 1- Now if we ignore the covariance of the cross 
terms cov(w k S k (6o),WjS'j(Qo)) (which are zero under the stationary model since S[(9o) , . . . , S' n (9o) then 
forms a martingale difference sequence) one wishes to design the weights w k to minimize ^ vaTo[S' k (9o)] 
subject to YIk=i w k E e (S' k {9o)) = 0. 

First notice that the quantity l/vaxo[S k (6o)] can be interpreted as the information that the data z k 
provides for 9q, conditional on the k — 1 nearer observations to t. In the fully nonstationary case we 
stipulate that l/varg[S' k (9o)] will decay to zero as t k — > oo and depend predominantly on the nuisance 
parameters that govern the deviation of 9{t k ) from 9{t) (e.g. on the coefficients of order > 1 of the 
Taylor expansion of 9{t k ) at t). For a particular estimation problem it may be possible to use prior 
knowledge on the scale of spatial variability of the local parameter function 9 to attempt to get some 
understanding of how l/varg[S' k (9o)] behaves as a function of t k . In our example we model l/vaiQ[S' k (9o)] 
as exp[— |t — t k \ 2 /2X 2 ] for some unknown univariate nuisance parameter A, which is essentially plays the 
roll of a bandwidth parameter. To approximate the unbiasedness requirement J2k=i w kEe{S' k {9o)) = 
we notice that EQ{S' k {9o)) is a function of t,t±, . . . ,t k and converges to zero as t k — > t. Moreover we 
believe its behavior will depend mostly on t k and t so that the first order Taylor expansion in t k at t is 
Eg(S' k (9o)) « Ctfi{t — tk) for an unknown constant Ctfi. This is our motivation for the following variational 
characterization of the local likelihood weights: 



minimize 



E 

k=i 



wjfcexp 



[\t - t k \ 2 /2X 2 } subject to 



TJk=l w k = 1, 



(0,0) 5 



(27) 
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Notice that this particular variational characterization is essentially the same as in the nonparameteric 
regression setting (see [7]). 



Remark: It is interesting to note that the solution of (27) is in the form 

w k = [a + b-(t- t k )] exp[-|t - t k \/2\ 2 } 
where a, b are essentially the lagrange multipliers of the variational problem. Therefore the solution of w k 



in (27) can be computed quickly by inversion of a 2 x 2 matrix for each A. 
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Figure 6: Top left: The true local parameter function ut used in the simulation. Top right: The 5000 
observation locations. Bottom left: The estimate of the local parameter function v% on the grid of 500 
estimation locations. Bottom right: The error vt — vt- 

To test our estimate of v% we simulated one realization of a mean zero Gaussian random field at 5000 
random observation locations in [0, l] 2 using autocovariance function (26) where p = 0.5 and a = 1 are 
assumed to be known. We used the algorithm for Cholesky updating found in [2] to compute the log- 
likelihoods £(3sf^fc|i/) for k = 1, . . . , 500 (500 was chosen so the computations could be done in reasonable 
time). A plot of the local parameter function ut and the sampling locations are shown on the top row of 
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diagrams in Figure|6j To adjust for the fact that there is ~75% less data at the corners, and ~50% near the 
edges, we scaled the bandwidth by 2 near the corner and by y/2 near the edges (by linear interpolation). 
The bottom left plot of Figure [6]shows the estimate £>t on a square grid of 500 estimation locations in [0, l] 2 
using the bandwidth estimator A2. The right plot shows the bias: v% — vt- One can see that the estimate vt 
does a good job of recovering the main features of the true local parameter function v^. However, there is 
a positive bias all throughout the observation region with more bias in the region with small Vf 



Appendix 

In this appendix we show that the nonstationary covariance function (25) is positive definite. Originally 
presented in [15] , the proof combines the results found in [10] , [9] for generating a spatially varying geometric 
anisotropy with those found in [12] for generating a local isotropic Matern with spatially varying smoothness 
parameter. We start by stating the following lemma which is proved using a convolution argument found 
in Paciorek's thesis [8] on, p. 27 . 

Lemma 1. Let at map t G IR^ to the set of real d x d positive definite matrices (denoted -P-Dd(R) ) and let 
<f){u) = exp(— |n| 2 /2) Then 

det(a~ 1/2 )0[a~ 1/2 (s-t)/cr] = c t c s [ </>[a; 1/2 (u - s)/a]<f)[a; 1/2 (u - t)/a]du (28) 

Jm. d 

where a st = (a s + a t )/2 and c s = (2ix)- d l A o-- d l 2 det(a~ 1/2 ). 



The importance of this lemma is that the right hand side of (28) is positive definite with locally varying 

— 1/2 J I 

geometric anisotropy a t . This follows since the right hand side of equation (28 ) equals cov(Z,j(s), Z a (t)) 
where Z a (t) = ct j^ d (j)\a t 1 ^ 2 (u — t)/a] dW(u) and dW is Gaussian white noise. Now let K a (s,t) = 

det(a st 1//2 )</>[a st 1/,2 (s — t)/a\ and notice that for any function g a {-) the function g (7 (s)g a (t)K a -(s, t) is posi- 
tive definite. Since convex combinations and limits of positive definite functions are again positive definite 
we have that 

/•oo 

/ ga{s) gcT (t)K a (s,t)dfi(a) (29) 
Jo 

is positive definite for any positive finite measure /i such that f Rd g^ityKait, t)dfi(a) < 00 for all t. 

Claim 1. Let a t : M. d -> M + , v t : M. d -> M + and a t : M. d PD d (R) and define a st = (a t + a,)/2, v st = 
(yt + u a )/2. Then 

R(s,t) = o- t o- s det(a~ t 1/2 )M U3t (|a^ 1/2 (s - t) |) 

is positive definite on where l\[ v (x) = x v % v {x) and % v is the modified Bessel function of the second 
kind of order v > (see JlJ/j. 
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Proof. Let g a {t) = {^Y'^ 1 and ji have density ^-e °" 2 / 2 with respect to Lebesque measure. Then 
equation (29) becomes 



— 1/2 

9a{s)ga(t)K a (s,t)dn(a) = det(a st ) 

x=a 2 /2 



oo / a 2 \ "at— 1 



det(a 



1/2, 



exp 



x" s! 1 exp 



K"/ /2 (^-t)| 2 a 2 
4(<7 2 /2) 2 

i«;/ /2 (^-t)i 2 

4a; 



da 



x 



dx 



= 2det(a;/ / V^ /2 2^ t/ W /2 (s " t)r st ^ 3t (|a- 1/2 (, - 01) 

where the last line is from (3.472.9) of Gradshteyn and Ryzhik [6]. Therefore det(a s ^ 2 )3v[ U3t (\a s ^ 2 (s — t)\) 
is positive definite. □ 
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